Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trie: parallelize committer #30461

Closed
wants to merge 5 commits into from
Closed

Conversation

stevemilk
Copy link
Contributor

Make node commit to be able to run in parallel, like node hash in hasher.go.

@karalabe
Copy link
Member

karalabe commented Sep 18, 2024 via email

@stevemilk
Copy link
Contributor Author

Got it, will do benchmark test then.

@stevemilk
Copy link
Contributor Author

Test with my own machine and it shows parallel mode is 20%-50% faster than single mode while the number of nodes is below 5k, see the picture below.

Screenshot 2024-09-19 at 12 47 18 AM

Hardware overview:
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB

@rjl493456442
Copy link
Member

Benchmark results on my linux machine

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: Intel(R) Core(TM) i7-14700K
BenchmarkCommit/commit-100nodes-single-28                  18546             66232 ns/op          115191 B/op       1462 allocs/op
BenchmarkCommit/commit-100nodes-parallel-28                18679             65397 ns/op          115205 B/op       1462 allocs/op
BenchmarkCommit/commit-200nodes-single-28                  10000            124273 ns/op          233517 B/op       2937 allocs/op
BenchmarkCommit/commit-200nodes-parallel-28                14197             92260 ns/op          235108 B/op       2968 allocs/op
BenchmarkCommit/commit-500nodes-single-28                   3777            313145 ns/op          604465 B/op       7606 allocs/op
BenchmarkCommit/commit-500nodes-parallel-28                 6948            336104 ns/op          606177 B/op       7638 allocs/op
BenchmarkCommit/commit-1000nodes-single-28                  1622            623420 ns/op         1174285 B/op      14816 allocs/op
BenchmarkCommit/commit-1000nodes-parallel-28                4302            320331 ns/op         1176365 B/op      14853 allocs/op
BenchmarkCommit/commit-2000nodes-single-28                  1098           1221359 ns/op         2305265 B/op      28850 allocs/op
BenchmarkCommit/commit-2000nodes-parallel-28                2102            600721 ns/op         2306553 B/op      28878 allocs/op
BenchmarkCommit/commit-5000nodes-single-28                   392           3530741 ns/op         6472302 B/op      74933 allocs/op
BenchmarkCommit/commit-5000nodes-parallel-28                 862           1615425 ns/op         6469649 B/op      74926 allocs/op
PASS

@rjl493456442
Copy link
Member

image
截屏2024-09-19 14 19 43
截屏2024-09-19 14 20 30

Deployed the PR/Master on benchmark 07/08 for one hour, it turns out this pull request does improve the account trie commit a bit.

@holiman
Copy link
Contributor

holiman commented Sep 19, 2024

Deployed the PR/Master on benchmark 07/08 for one hour, it turns out this pull request does improve the account trie commit a bit.

I also looked at those charts a bit. I'm a bit confused. Where is snapshot account read, and snapshot storage read? I guess we have not updated both side-by-side charts, only one, after changing in the metrics?

@holiman
Copy link
Contributor

holiman commented Sep 19, 2024

Also relevant (this PR starts about halfway in, time-wise)
Screenshot 2024-09-19 at 08-33-19 Dual Geth - Grafana

@rjl493456442
Copy link
Member

@holiman Yes, i just deployed it on a ongoing benchmark pairs (new release vs last release) to have a quick test.

@rjl493456442
Copy link
Member

I also looked at those charts a bit. I'm a bit confused. Where is snapshot account read, and snapshot storage read?

snapshot account read and snapshot storage read metrics are deleted in state reader abstraction PR. Now you have to use account read and storage read instead.

we have not updated both side-by-side charts, only one, after changing in the metrics

True, we need to update them

@fjl fjl changed the title trie: parallize committer trie: parallelize committer Sep 26, 2024
}
}

type wrapNode struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole concept of wrapping nodes -- I don't see the point in it. Why is that needed? Couldn't you just copy the path for each goroutine, and then let each goroutine work on it's own path-copy individually without risking any cross-goroutine disruptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because multiple goroutines calling AddNode will cause concurrent map writes.
This could be solved by using mutex, but using wrapping nodes to avoid lock/unlock can save a little more time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we're parallelizing things, you could have a chanwhere you send off the nodes, and then have a dedicated goroutine which just reads the chan and invokes AddNode sequentially. If you make the chan use some buffering, then it should be faster than using a lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did try this approach before. However, it still requires a new struct, similar to WrapNode, to initialize the chan since AddNode requires both the node itself and the path. Additionally, we would need an extra mechanism to monitor when the send operation is completed. Here's the old draft PR illustrating this approach, rough version.

For the idea of parallelizing committer and invoking AddNode sequentially, current PR can also achieve in a cleaner way, meanwhile just as fast. I believe both approaches are valid and can work well, and I'm open to further discussion if you think there's a strong case for the alternative.

@holiman
Copy link
Contributor

holiman commented Oct 3, 2024

I've attempted to do it differently.

Concentrating only on the larg(er) trie, here are my numbers for this PR:

BenchmarkCommit/commit-5000nodes-single-8                      110           13174997 ns/op         6467652 B/op      74881 allocs/op
BenchmarkCommit/commit-5000nodes-parallel-8                    212            5879169 ns/op         6471346 B/op      74935 allocs/op

And the unmodified master, with a similar benchmark:

// BenchmarkCommit/commit-5000nodes-single-8         	       144      	   9601554 ns/op	     4070863 B/op	   51615 allocs/op

Thus: for non-parallel, this PR is a degradation, most likely because of the additional mem usage, which goes up by ~50%. So speed goes from 9ms -> 13ms whereas the parallel makes it go from 9ms to 6ms.

The memory overhead is what I saw in the graphs above, on the live benchmark nodes.

In my branch parallel_commit_alt, I've attempted a different approach: branch against upstream master, diff against your branch

  • Get rid of the wrapped node struct
  • On top level, spin out new committers, one per goroutine. Each committer has it's own internal nodeset. After finishing it's work, each goroutine merges it's own nodeset with the parent nodeset.

With that approach, I get

BenchmarkCommit/commit-5000nodes-single-8                    122          10880174 ns/op         4623681 B/op      55263 allocs/op
BenchmarkCommit/commit-5000nodes-parallel-8                  331           3701590 ns/op         4726417 B/op      55604 allocs/op

So the 9ms -> 11ms for single-thread (but I suspect that might be a fluke), and 4ms for multi-threaded approach. The mem usage is only marginally higher than current code.

OBS: I have not ascertained that the implementation actually is correct: that the nodes collected are the right ones. So the next step would be to actually test that it does what it's supposed to do. But all in all, I don't think the current +50% mem usage is a workable approach.

@holiman
Copy link
Contributor

holiman commented Oct 3, 2024

Fixed up now so it is correct. New numbers on my branch (basically no change):

BenchmarkCommit/commit-5000nodes-single-8                    122          11983636 ns/op         4623249 B/op      55258 allocs/op
BenchmarkCommit/commit-5000nodes-parallel-8                  261           4930507 ns/op         5592596 B/op      55802 allocs/op

@holiman
Copy link
Contributor

holiman commented Oct 13, 2024

Closing this in favour of #30545 . Thanks for picking it up from the start!

@holiman holiman closed this Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants